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Abstract 

This paper traces the strong relations between experimental design and control, such as the use of optimal inputs to obtain 
precise parameter estimation in dynamical systems and the introduction of suitably designed perturbations in adaptive control. 
The mathematical background of optimal experimental design is briefly presented, and the role of experimental design in the 
asymptotic properties of estimators is emphasized. Although most of the paper concerns parametric models, some results are 
also presented for statistical learning and prediction with nonparametric models. 
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1 Introduction 

The design of experiments (DOE) is a well developed 
methodology in statistics, to which several books have 
been dedicated, see e.g. [42], [167], [125], [4], [149], [44]. 
See also the series of proceedings of the Model-Oriented 
Design and Analysis workshops (Springer Verlag 1987; 
Physica Verlag, 1990, 1992, 1995, 1998, 2001, 2004). Its 
application to the construction of persistently exciting 
inputs for dynamical systems is well known in control 
theory (see Chapter 6 of [58], Chapter 14 of [104], Chap- 
ter 6 of [188], the book [196] and the recent surveys [53], 
[66]). A first objective of this paper is to briefly present 
the mathematical background of the methodology and 
make it accessible to a wider audience. DOE, which can 
can be apprehended as a technique for extracting the 
most useful information from data to be collected, is 
thus a central (and sometimes hidden) methodology in 
every occasion where unknown quantities must be esti- 
mated and the choice of a method for this estimation is 
open. DOE may therefore serve different purposes and 
happens to be a suitable vehicle for establishing links 
between problems like optimization, estimation, predic- 
tion and control. Hence, a second objective of the paper 
is to exhibit links and similarities between seemingly dif- 
ferent issues (for instance, we shall see that parameter 
estimation and prediction of a model response are es- 
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sentially equivalent problems for parametric models and 
that the construction of an optimal method for global 
optimization can be casted as a stochastic control prob- 
lem) . At the same time, attention will be drawn to fun- 
damental differences that exist between seemingly simi- 
lar problems (in particular, evidence will be given of the 
gap between using parametric or nonparametric models 
for prediction). A third objective is to point out and ex- 
plain some inherent difficulties in estimation problems 
when combined with optimization or control (hence we 
shall see why adaptive control is an intrinsically difficult 
subject), indicate some tentative remedies and suggest 
possible developments. 

Mentioning these three objectives should not shroud the 
main message of the paper, which consists in pointing out 
prospective research directions for experimental design in 
relation with control, in short: classical DOE relies on the 
assumption of persistence of excitation but many issues 
remain open in other situations; DOE should be driven 
by the final purpose of the identification (the intended 
model application of [57] ) and this should be reflected in 
the construction of design criteria; DOE should face the 
new challenges raised by nonparametric models and ro- 
bust control; algorithms and practical methods for DOE 
in non-standard situations are still missing. The program 
is rather ambitious, and this survey does not pretend to 
be exhaustive (for instance, only the case of scalar ob- 
servations is considered; Bayesian techniques are only 
slightly touched; measurement errors are assumed to be 
independent, although correlated errors would deserve 
a special treatment; distributed parameter systems are 



Preprint submitted to Automatica 



3 March 2008 



not considered; nonparametric modelling is briefly con- 
sidered and for static systems only, etc.). However, ref- 
erences are indicated where a detailed enough presen- 
tation is lacking. None of the results presented is really 
new, but their collection in a single document probably 
is, and will hopefully be useful to the reader. 

Section 2 presents different types of application of opti- 
mal experimental design, partly through examples, and 
serves as an introduction to the topic. In particular, the 
fourth application concerns optimization and forms a 
preliminary illustration of the link between sequential 
design and adaptive control. Section 3 concerns statisti- 
cal learning and nonparametric modelling, where DOE 
is still at an early stage of development. The rest of the 
paper mainly deals with parametric models, for which 
parameter uncertainty is suitably characterized through 
information matrices, due to the asymptotic normality of 
parameter estimators and the Cramer-Rao bound. This 
is considered in Section 4 for regression models. Section 
5 presents the mathematical background of optimal ex- 
perimental design for parameter estimation. The case of 
dynamical models is considered in Section 6, where the 
input is designed to yield the most accurate estimation 
of the model parameters, while possibly taking a robust- 
control objective into account. Section 7 concerns adap- 
tive control: the ultimate objective is process control, 
but the construction of the controller requires the esti- 
mation of the model parameters. The difficulties are il- 
lustrated through a series of simple examples. Optimal 
DOE yields input sequences that are optimally (and per- 
sistently) exciting. At the same time, by focussing at- 
tention on parameter estimation, it reveals the intrinsic 
difficulties of adaptive control through the links between 
dual (active) control and sequential design. General se- 
quential design (for static systems) is briefly considered 
in Section 8. Finally, Section 9 suggests further devel- 
opments and research directions in DOE, concerning in 
particular active learning and nonlinear feedback con- 
trol. Here also the presentation is mainly through exam- 
ples. 

2 Examples of applications of DOE 

Although the paper is mainly dedicated to parameter 
estimation issues, DOE may have quite different objec- 
tives (and it is indeed one of the purposes of the pa- 
per to use DOE to exhibit links relating these objec- 
tives) . They are illustrated through examples which also 
serve to progressively introduce the notations. The first 
one concerns an extremely simple parameter estimation 
problem where the benefit of a suitably designed exper- 
iment is spectacular. 

2.1 A weighing problem 

Suppose we wish to determine the weights of eight ob- 
jects with a chemical balance. The result y of a weigh- 



ing (the observation) corresponds to the mass on the left 
pan of the balance minus the mass on the right pan plus 
some measurement error e. The errors associated with a 
series of measurements are assumed to be independently 
identically distributed (i.i.d.) with the normal distribu- 
tion Af(Q, a 2 ). The objects have weights i — 1, . . . , 8. 
Each weighing is characterized by a 8-dimensional vec- 
tor u with components {u}i equal to 1, —1 or depend- 
ing whether object i is on the left pan, the right pan or 
is absent from the weighing, and the associated observa- 
tion is y = u T 8 + e. We thus have a linear model (in the 
statistical sense: the response is linear in the parameter 

vector 8), and the Least-Squares (LS) estimator 8 as- 
sociated with N observations yu characterized by exper- 
imental conditions (design pointfH) life, k = 1, . . . , N, is 

N N 

8 N = argmin^[t/ fe - uj8} 2 = M^ 1 J]y fc u fc , (1) 

" k=l k=l 
N 

with M w = u fc u fe ■ (2) 

k=l 

We consider two weighing methods. In method a the 
eight objets are weighed successively: the vectors for 
the eight observations coincide with the basis vectors 
ej of K 8 and the observations are y\ — di + Si, i = 
1, . . . , 8. The estimated weights are simply given by the 
observations, that is, 0, = yi ~ Af(9i,a 2 ). Method b 
is slightly more sophisticated. Eight measurements are 
performed, each time using a different configuration of 
the objets on the two pans so that the vectors form a 
8x8 Hadamard matrix (|{uj}j| = 1 Vi, j and u^u,- = 
Vi j, i,j = 1,...,8). The estimates then satisfy 
&i ~ Af(0i,a 2 /8) with 8 observations only. To obtain 
the same precision with method a, one would need to 
perform eight independent repetitions of the experiment, 
requiring 64 observations in totaPI. 

In a linear model of this type, the LS estimator (1) is 

unbiased: lEg{8 } — 8 = 0, where IEgj-} denotes the 
mathematical expectation conditionally to 8 being the 
true vector of unknown parameters. Its covariance ma- 
trix is TE e {(8 N -8)(8 N -8) T } = ^M^ 1 with Mat given 
by (2) (note that it does not depend on 9) . Choosing an 



Although design points and experimental variables are 
usually denoted by the letter x in the statistical literature, 
we shall use the letter it due to the attention given here 
to control problems. In this weighing example, denotes 
the decisions made concerning the k-ih observation, which 
already reveals the contiguity between experimental design 
and control. 

2 Note that we implicitly assumed that the range of the 
instrument allows to weigh all objects simultaneously in 
method b. Also note that the gain would be smaller when 
using method b if the variance of the measurement errors 
increased with the total weight on the balance. 
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experiment that provides a precise estimation of the pa- 
rameters thus amounts to choosing N vectors such 
that (M jv is non singular and) "M^ 1 is as small as pos- 
sible" , in the sense that a scalar function of M^ 1 is min- 
imized (or a scalar function of Mjy is maximized), see 
Section 5. In the weighing problem above the optimiza- 
tion problem is combinatorial since {uk}i € { — 1,0,1}. 
In the design of method b the vectors optimize most 
"reasonable" criteria $(M N ), see, e.g., [29], [162]. This 
case will not be considered in the rest of the paper but 
corresponds to a topic that has a long and rich history 
(it originated in agriculture through the pioneering work 
of Fisher, see [46]). 

2.2 An example of parameter estimation in a dynamical 
model 

The example is taken from [39] and concerns a so-called 
compartment model, widely used in pharmacokinetics. 
A drug x is injected in blood (intravenous infusion) with 
an input profile u(t), the drug moves from the central 
compartment C (blood) to the peripheral compartment 
P, where the respective quantities of drugs at time t 
are denoted xc(t) and xp(t). These obey the following 
differential equations: 

f ^IT 1 = {-Kel~ K CP )x c {t) + K PC xp(t)+u(t) 
I ^ = K CP xc{t) - K PC xp{t) 

where Kcp 1 Kpc and Ke l are unknown parameters. 
One observes the drug concentration in blood, that is, 
y(t) = xc(t)/V + e(t) at time t, where the errors eifi) 
corresponding to different observations are assumed to 
be i.i.d. 7V(0, ex 2 ) and where V denotes the (unknown) 
volume of the central compartment. There are thus four 
unknown parameters to be estimated, which we denote 
= (K C p, Kpc, K E l, V). The profile of the input u(t) 
is imposed: it consists of a 1 min loading infusion of 75 
mg/min followed by a continuous maintenance infusion 
of 1.45 mg/min. The experimental variables correspond 
to the sampling times tj, 1 < t» < 720 min (the time in- 
stants at which the observations — blood samples — are 
taken). Suppose that the true parameters take the values 
= (0.066mm" 1 , 0.038mm" 1 , 0.0242 mm" 1 , 301). 
Two different experimental designs are considered. 
The first one, called "conventional" , is given by 
t = (5,10,30,60, 120, 180, 360,720) (in min); the "op- 
timal" one (D-optimal for 0, see Section 5.1) is 
t* = (1,1,10,10,74,74,720,720) (in min). (Note that 
both designs contain 8 observations and that t* com- 
prises repetitions of observations at the same time 
- which means that it is implicitly assumed that the 
collection of several simultaneous independent measure- 
ments is possible.) Figure 1 presents the (approximate) 
marginal density for the LS estimator of K el, see [129], 
[139], when a = 0.2/xg/ml. Similar pictures are obtained 
for the other parameters. 
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Fig. 1. Approximate marginal densities for the LS estimator 
Kel (dashed line for the conventional design, solid line for 
the optimal one); the true value is Kel = 0.0242 min -1 . 

Clearly, the "optimal" design t* yields a much more pre- 
cise estimation of than the conventional one, although 
both involve the same number of observations. On the 
other hand, with 4 = dim(0) sampling times only, t* 
does not permit to test the validity of the model. DOE 
for model discrimination, which we consider next, is es- 
pecially indicated for situations where one hesitates be- 
tween several structures. 

2.3 Discrimination between model structures 

Design for discrimination between model structures will 
not be detailed in the paper, only the basic principle of 
a simple method is indicated below and one can refer to 
[17] and the survey papers [3], [65] for other approaches. 
When there are two model structures rp-\6\,u) and 
rj( 2 \02,u) and the errors are i.i.d., a simple sequential 
procedure is as follows, see [5]: 

• after the observation of y(u\), . . . ,y(\ik) estimate 1 

-k 

and 2 for both models; 

k *-k 

• place next point Ufc+i where [7/ 1 ) (0 1 , u) — r/ 2 ' (0 2 , u)] 2 
is maximum; 

• k — ► k + 1, repeat. 

When there are more than two structures in competition, 

one should estimate i for all of them and place the next 
point using the two models with the best and second 
best fitting, see [6] . The idea is to place the design point 
where the predictions of the competitors differ much, 
so that when one of the structures is correct (which is 
the underlying assumption) , next observation should be 
close to the prediction of that model and should thus give 
evidence that the other structures are wrong. Similar 
ideas can be used to design input sequences for detecting 
changes in the behavior of dynamical systems, see the 
book [82]. 

2.4 Optimization of a model response 

Suppose that one wishes to maximize a function rj(0, u) 
with respect to u e M. d , with e W a vector of un- 
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known parameters. When a value is proposed, the 
function is observed through yi = y(ui) = r](9, Uj) + £j 
with Ei a measurement error. Since the problem is to de- 
termine u* = u*(9) = argmax u r)(6, u), it seems natu- 
ral to first estimate 9 = 9[y] from a vector of observa- 
tions y = [yi, . . . , yjv] T and then predict the optimum 
by u*(0). The question is then which values to use for 
the Uj's for estimating 9, that is, which criterion to opti- 
mize for designing the experiment? It could be (i) based 
on the precision of 0, or (ii) based on the precision of 
u*(9), or, preferably, (hi) oriented towards the final ob- 
jective and based on the cost C{9\9) of using 9 when 
the true value of 9 is 9. A possible choice is C(9\9) = 
r][0,u*(0)]-ri[0,u*(0)] > 0, which leads to a design that 
minimizes the Bayesian risk R = TE{C(9[y]\9)}, where 
the expectation is with respect to y and 9 for which a 
prior distribution tt(-) is assumed, see, e.g., [144] (see also 
[27] and the book [132] for a review of Bayesian DOE). 

The approaches (i-iii) above are standard in experimen- 
tal design: optimization is performed in two steps, first 
some design points Uj's are selected for estimation, sec- 
ond 9 is estimated and used to construct u*(9). How- 
ever, in general each response 7/(0, Uj) is far from the 
maximum i][9, u*(9)} (since the explicit objective of the 
design is estimation, not maximization) while in some 
situations it is required to have rj{9, u^) as large as pos- 
sible for every i, that is, close to u*(0), which is un- 
known. A sequential approach is then natural: try u i} 

observe yi, estimate 9 = 9(y\), suggest u^ + i and so 
on. . . (Notice that this involves a feedback of informa- 
tion in the sequence of design points — the control se- 
quence — and thus induces a dynamical aspect although 
the initial problem is purely static.) Each has_two 
objectives: help to estimate 9, try to maximize r](9, u). 
The design problem thus corresponds to a dual control 
problem, to be considered in Section 7.4. When no para- 
metric form is known for the function to be maximized, 
it is classical to resort to suboptimal methods such as 
the Kiefer-Wolfowitz scheme [83], or the response sur- 
face methodology which involves linear and quadratic 
approximations, see, e.g., [18]. Optimization with a non- 
parametric model will be considered in Section 9, com- 
bining statistical learning with global optimization. 

3 Statistical learning, nonparametric models 

One can refer to the books [183], [184], [62] and the 
surveys [37], [10] for a detailed exposition of statisti- 
cal learning. Based on so-called "training data" V = 
{[ui, y(ui)], . . . , [ujv, y(ujv)]} we wish to predict the re- 
sponse y(u) of a process at some unsampled input u us- 
ing Nadaraya- Watson regression [118], [189], Radial Ba- 
sis Functions (RBF), Support Vector Machine (SVM) 
regression or Kriging (Gaussian process). All these ap- 
proaches can be castcd in the class of kernel methods, 



see [185], [186] and [159] for a more precise formulation, 
and we only consider the last one, Kriging, due to its 
wide flexibility and easy interpretability. The associated 
DOE problem is considered in Section 3.2. We denote 
yr>(u) the prediction at u and y = [y(ui), . . . ,y(ujv)] T . 

3. 1 Gaussian process and Kriging 

The method originated in geostatistics, see [86], [107], 
and has a long history. When the modelling errors con- 
cern a transfer function observed in the Nyquist plane, 
the approach possesses strong similarities with the so- 
called "stochastic embedding" technique, see, e.g., [59] 
and the survey paper [124]. The observations are mod- 
elled as y(uk) = 9o + P{\ik,oj) + £&, where P(u,u>) de- 
notes a second-order stationary zero-mean random pro- 
cess with covariance !E{P(u, u))P(z, ui)} = K(u, z) = 
opC(u — z) and the £fc's are i.i.d., with zero mean and 
variance a 2 . The best linear unbiased predictor at u is 
2/x>(u) = v T (u)y, where v(u) minimizes IE{(v T y— [9q + 
P(u,uj)]) 2 } with the constraint !E{v T y} ^^oEili"' = 
IE{y(u)} = (9o, that is, X^iLi v i = 1- This optimization 
problem is solvable explicitly, which gives 



jto(u) = v T (u)y = §"+ c T (u)C- 1 (y - 6°1) 



(3) 



where C y = ct 2 Iat + o-pCp with Ijv the TV-dimensional 
identity matrix and Cp the N x N matrix defined by 
[Cpjjj = C(uj — Uj), 1 is the TV-dimensional vector with 
components 1, c(u) = op[C(u — ui), . . . , C(u — ujv)] t 
and 6 a = (l T C J 7 1 y)/(l T Cj7 1 l) (a weighted LS esti- 
mator of 9 ). Note that the prediction takes the form 

7/x>(u) = 9° + X^feLi a kK(u, Ufe), i.e., a linear combina- 
tion of kernel values. The Mean-Squared-Error (MSE) 
of the prediction y-p(u) at u is given by 



pl(u)=a 2 P - c T (u) 1 



Cy 


1 


-1 


'c(u)" 


1 T 







1 



(4) 



and, if a 2 = (i.e., there are no measurement errors £&), 
yx>(uj) = y{ui) and p%(vii) = for any i. The predic- 
tor 2/x>(u) is then a perfect interpolator. This method 
thus makes statistical inference possible even for purely 
deterministic systems, the model uncertainty being rep- 
resented by the trajectory of a random process. Since 
the publication [157] it has been successfully applied in 
many domains of engineering where simulations (com- 
puter codes) replace real physical experiments (and mea- 
surement errors are thus absent), see, e.g., [158]. 

If the characteristics of the process P(u,u>) and errors 
£fc belong to a parametric family, the unknown parame- 
ters that are involved can be estimated. For instance, for 
a Gaussian process with C(z) parameterized as C(z) = 
C((3, z) and for normal errors £&, the parameters 0, dp 
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and a 2 can be estimated by Maximum Likelihood; see 
the book [173], in particular for recommendations con- 
cerning the choice of the covariance function C(z). See 
also the survey [105] and the papers [195], [181] con- 
cerning the asymptotic properties of the estimator. The 
method can be extended in several directions: the con- 
stant terms 6*o can be replaced by a linear model r T (u)8 
(this is called universal Kriging, or intrinsic Kriging 
when generalized covariances are used, which is then 
equivalent to splines, see [187]), a prior distribution can 
be set on 6 (Bayesian Kriging, see [38]), the derivative 
(gradient) of the response y(u) can also be predicted 
from observations y(uk), see [185], or observations of 
the derivatives can be used to improve the prediction of 
the response, see [114], [106], [102]. Nonparametric mod- 
elling can be used in optimization, and an application of 
Kriging to global optimization is presented in Section 9. 

3.2 DOE for nonparametric models 

The approaches can be classified among those that are 
model-free (of the space-filling type) and those that use 
a model. 

3.2.1 Model- free design (space filling) 

For U the design set (the admissible set for u), we call 
SS C U the finite set of chosen design points or sites 
Ufc where the observations are made, k = 1,...,N. 
Maximin- distance design [78] chooses sites SS that 
maximize the minimum distance between points of SS, 
i.e. min U7 i u / e 552 d(u, u'). The chosen sites Ufc are thus 
maximally spread in hi (in particular, some points are 
set on the boundary of U). When U is a discrete set, 
minimax- distance design [78] chooses sites that mini- 
mize the maximum distance between a point in U and 
SS, i.e. max zeW d(z, SS) = max z£U min ueS s d(z, u). 
In order to ensure good projection properties in all 
directions (for each component of the u^'s), it is rec- 
ommended to work in the class of latine hypercube 
designs, see [113] (when U is scaled to [0, l] d , for every 
i = 1, . . . , d the components {uk}i, k — 1, . . . , N, then 
take all the values 0, 1/(N- 1), 2/(N - 1), . . . , 1). 

3.2.2 Model-based design 

In order to relate the choice of the design to the quality 
of the prediction yrj(u), a first step is to characterize the 
uncertainty on j/£>(u). This raises difficult issues in non- 
parametric modelling, in particular due to the difficulty 
of deriving a global measure expressing the speed of de- 
crease of the MSE of the prediction as AT, the number of 
observations, increases (we shall see in Section 5.3.4 that 
the situation is opposite in the parametric case) . A rea- 
son is that the effect of the addition of a new observation 
is local: when we observe at u, the MSE of the prediction 
at z decreases for z close to u (for instance, for Kriging 
without measurement errors pv(u) becomes zero), but 



is weakly modified for z far from u. Hence, DOE is often 
ignored in the statistical learning literatur^], where the 
set of training data V is generally assumed to be a col- 
lection of i.i.d. pairs [iifc,y(ufc)], see, e.g., [37], [10]. The 
local influence just mentioned has the consequence that 
an optimal design should (asymptotically) tend to ob- 
serve everywhere in U, and distribute the points with 
a density (i.e. according to a probability measure abso- 
lutely continuous with respect to the Lebesgue measure 
onM — again, we shall see that the situation is oppo- 
site for the parametric case). Few results exist on that 
difficult topic, see e.g. [30]: for u scalar, observations 
y(uk) = f(uk) +£fc with i.i.d. errors Ek, and a prediction 
of the Nadaraya- Watson type ([118], [189]), a sequential 
algorithm is constructed that is asymptotically optimal 
(it tends to distribute the points Uk with a density pro- 
portional to |/"(u)| 2 / 9 ). See also [115], [41] for related 
results. The uniform distribution may turn out to be op- 
timal when considering minimax optimality over a class 
of functions, see [12]. 

When Kriging is used for prediction, the MSE is given 
by (4) and SS can be chosen for instance by minimizing 
the maximum MSE max u6 ^ p 2 -, (u) (which is related to 
minimax-distance design, see [78]) or by minimizing the 
integrated MSE J u Pj,(\i) ir(du), with ir(-) some proba- 
bility density for u, see [156] . Maximum entropy sampling 
[163] provides an elegant alternative design method, usu- 
ally requiring easier computations. It can be related to 
maximin-distance design, see [78] . 

Notice finally that in general the parameters (3, op and 
<7 2 in the covariance matrix C y used in Kriging are esti- 
mated from data, so that the precision of their estimation 
influences the precision of the prediction. This seems to 
have received very little attention, although designs for 
prediction (space filling for instance) are clearly not ap- 
propriate for the precise estimation of these parameters, 
see [197]. 

4 Parametric models and information matrices 

Throughout this section we consider regression models 
with observations 

y(iifc) = 77(0, u fc ) + £ fe , 6 e 6, u fe eU, (5) 

where the errors are independent with zero mean 
and variance IE Ufc (e|) = er 2 (ufc), k = 1,2,... (with 



3 There exists a literature on active learning, which aims at 
selecting training data using techniques from DOE. However, 
it seems that when explicit reference to DOE is made, the 
attention is restricted to learning with a parametric model, 
see in particular [33], [34]. In that case, the underlying as- 
sumption that the data are generated by a process whose 
structure coincides with that of the model is often hardly 
tenable, especially for a behavioral model e.g. of the neural- 
network type, see Section 5.3.4 for a discussion. 
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< a < er 2 (u) < b < oo). The function rj(0, u k ) is 
known, possibly nonlinear in 0, and 0, the true value of 
the model parameters, is unknown. The asymptotic be- 
havior of the LS estimator, in relation with the design, is 
recalled in the next section (precise proofs are generally 
rather technical, and we give conditions on the design 
that facilitate their construction). Maximum-Likelihood 
estimation and estimating functions are considered next. 
The extension to dynamical systems requires more tech- 
nical developments beyond the scope of this paper. One 
can refer e.g. to [58], [104], [24] [171] for a detailed pre- 
sentation, including data-recursive estimation methods. 
Also, one can refer to the monograph [180] for the identi- 
fication of systems with distributed parameters and e.g. 
to [90], [151], [152] for optimal input design for such sys- 
tems. 

4-1 Weighted LS estimation 

Consider the weighted LS (WLS) estimator 

N N 

Owls = argmin(l/iV) ^ w(u k )[y(u k ) - rj(0, u k )] 2 
e fe=i 



continuously differentiable in and that the matrix 

d y(0,u) dr,(0,u) 

d0 |0 d T \t ( j 



Mi(£,0) = J w(u) 
u 



has full rank, an application of the Central Limit The- 
orem to a Taylor series development of \7 g J N (0), the 



gradient of the WLS criterion, around WLS gives 

^N(Cls -0)±z~ AT(p, C(w, £, §)) , N -> ex. , (6) 
where C(w, f , 0) = M" 1 ^, 0)M 2 (£, 0)M- 1 (^ 0) with 



M 2 (£,0) = J w 2 (u)* 2 (u) 



2 / \ 2 / \ 9r/{0,u) dr)(0,u) 



00 80 



u 



One may notice that C(w,£, 0) — M _1 (£, 0) is non- 
negative definite for any weighting function w(-), where 
M(£, 0) denotes the matrix 



M(£,0) 



- 2 (u) dv{ ^ u) dv i e : u h (du) . 
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80' 



(7) 



with w(-) a known function, bounded on hi. To inves- 

tigate the asymptotic properties of 0\yls f° r N ^ oo 
we need to specify how the design points u^'s are gen- 
erated. In that sense, the asymptotic properties of the 
estimator are strongly related to the design. The early 
and now classical reference [77] makes assumptions on 
the finite tail products of the regression function 77 and 
its derivatives, but the results are more easily obtained 
at least in two cases: 

• (i) (ufc) forms a sequence of i.i.d. random variables 
(vectors), distributed with a probability measure £ 
(which we call random design) ; 

• (ii) The empirical measure £jv with distribution func- 
tion IF^^u) = X^iLi u <u(V-^0 (where the inequality 
Ui < u is componentwise) converges strongly (in varia- 
tion, see [164], p. 360) to a discrete probability measure 
£ onW, with finite support SS^ = {u e U : £({u}) > 0}, 
that is, limAr^oo £jv({u}) = £({u}) for any ueU. 

Note that in case (i) the pairs (£fc,Ufc) are i.i.d. and in 
case (ii) there exist a finite number of support points u l 
that receive positive weights £(u 4 ) > 0, so that, as N 
increases, the observations at those u*'s are necessarily 
repeated. In both cases the asymptotic distribution of 
the estimator is characterized by £. 



a N 1± a 



The strong consistency of WLS , 
N — > 00, can easily be proved for designs satisfy- 
ing (i) or (ii) under continuity and boundedness as- 
sumptions on r/(0, u) when the estimability condition 
[f u w(u)[r](0,u) - V (0',u)} 2 t(du) =0^6' = 6} is sat- 
isfied. Supposing, moreover, that r](0, u) is two times 



The equality C(w,£,0) = M _1 (£, 0) is obtained for 
w{u) = c<j~ 2 (u), with c a positive constant, and this 
choice of w(-) is thus optimal (in terms of asymptotic 
variance) among all WLS estimators. This result can be 
compared to that obtained for linear regression in Sec- 
tion 2.1 where a^M^ 1 was the exact expression for the 

variance of for N finite. In nonlinear regression the 

- ~N 

expression C(w,^,0)/N for the variance of is only 
valid asymptotically, see_(6); moreover, it depends on 
the unknown true value of the parameters. These re- 
sults can easily be extended to situations where also the 
variance of the errors depends on the parameters of 
the response n, that is, IE Ufc (e^) = cr 2 (u k ) = f3X(0,u k ), 
see e.g. [130], [140]. 

4-2 Maximum-likelihood estimation 

Denote <fu k (') the probability density function (pdf) 
of the error e k in (5). Due to the independence of er- 
rors, we obtain for the vector y of observation the pdf 

and the Maximum-Likelihood (ML) estimator ML min- 
imizes - log 7r(y|0) = J2k=i -iog^ufc [y(ufc)- r?(0, ii*,)]. 
Different pdf ip yield different estimators (LS for Gaus- 
sian errors, L\ estimation for errors with a Laplace dis- 
tribution, etc.). Under standard regularity assumptions 
on <p u (") an d for designs satisfying conditions (i) or (ii) 



of Section 4.1, 



Vn(6 



ML 



ML 



0)^Z 



and 



■A/-(O,M^(£,0)), AT ^00, (8) 



6 



with Mf(0,£) the Fisher information matrix (average 
per sample) given by 



1 G>log^(y|0)dlog^(y|0) 



N 



89 

1 d 2 log^(y|fl) 
JV «90<90 T 



In the particular case of the regression model considered 
here we obtain 



m F (£,0)= y i(u) 



o?7(e, u) a?7(e, u) 



£(du) 



(9) 



withl(u) = J , [( f 5( 1 (z)] 2 /(p u (2:) ciz the Fisher information 
for location of the pdf (p u . From the Cramer-Rao in- 
equality, M^ 1 ^, 9) forms a lower-bound on the covari- 

~N 

ance matrix of any unbiased estimator 9 of 0, i.e., 

TE 9 {(0 N - 6){0 N - 0) T } - M^ 1 (£, 9) /N is non-negative 

definite for any estimator 9 such that IEg{0 } = 9. 
When the errors e k are normal Af(0, a 2 (u k )), I(u) = 
er~ 2 (u) and ML estimation coincides with WLS with op- 
timal weights (and M F (£, 9) coincides with (7)). When 
they are i.i.d., that is <p u — ip for any u, I(u) = X con- 
stant, and 



M F ({,9)=1 



dr)(0,u) drj(0,u) 



89 



89 



adu) 



(10) 



4-2.1 Estimating functions 

Estimating functions form a very generally applicable 
set of tools for parameter estimation in stochastic mod- 
els. As the example below will illustrate, they can yield 
very simple estimators for dynamical systems. One can 
refer to [63] for a general exposition of the methodology, 
see also the discussion paper [103] that comprises a short 
historical perspective. Instrumental variables methods 
(see, e.g., [169], [170] and Chapter 8 of [171]) used in dy- 
namical systems as an alternative to LS estimation when 
the regressors and errors are correlated (so that the LS 
estimator is biased) can be considered as methods for 
constructing unbiased estimating functions. Their im- 
plementation often involves the construction of regres- 
sors obtained through simulations with previous values 
of parameter estimates, but simpler constructions are 
possible. 

Consider a discrete-time system with scalar state and 
input, respectively Xi and m, defined by the recurrence 
equation 



x i+ i = Xi + T[m 



1)], t = 0,l,2.. 



(11) 



with known sampling period T and initial state Xq. The 
observations are given by = Xi+Eifori > 1, where (si) 



denotes a sequence of i.i.d. errors normal 7V(0, a 2 ). The 
unknown parameter 9 can be estimated by LS (which 
corresponds to ML estimation since the errors are nor- 
mal) , but recursive LS cannot be used since Xi depends 
nonlinear ly in 9. However, simpler estimators can be 
used if one is prepared to loose some precision for the 
estimation. For instance, substitute y{ for the state Xi in 
(11) and form the equation in 9 



9i+i{0) = V%+i - Hi- T[m + 9{yi + 1)] = : 



(12) 



k successive observations then give G k (9) — ^ Yli=i 9i W 
= 0. Since G k {9) is linear in the yt's, TEg{G k (9)}' = 
for any 9, and G k {9) is called an unbiased estimating 
function^, see, e.g., [103]. Since G k (9) is linear in 9, the 
solution 9 k of G k {9) = is simply given by 



(Vk - ya)/(kT) - QCLo u i)l k 

i + (£-=oWfc 



(13) 



(provided that the denominator is different from zero) 
and forms an estimator for 9. Notice that the true value 
9 satisfies a similar equation with the j/j's replaced by 
the noise- free values a;,. Estimation by 9 k is less pre- 
cise than LS estimation, see Figure 3 in Section 9, but 
requires much less computations. Would other parame- 
ters be present in the model, other estimating functions 
would be required. For instance, a function of the type 

G k ,u{9) = Sj=i * Q .9«(^) would put more stress on the 
transient (respectively long-term) behavior of the sys- 
tem when a < (respectively a > 0). Also, the multipli- 
cation oigi{9) by a known function of Ui gives a new esti- 
mating function. When information on the noise statis- 
tics is available, it is desirable for the (asymptotic) preci- 
sion of the estimation to choose G k as (proportional to) 
an approximation of the score function 8 log it (y\9) / 89 
with 7r(y|0) the pdf of the observations yi, . . . , y k , see, 
e.g., [36] p. 274 and [103]. 

There seems to be a revival of interest for estimating 
functions, partly due to the elegant algebraic framework 
recently developed for time-continuous linear systems 
(differential equations); see [47] where estimating func- 
tions are constructed through Laplace transforms. How- 
ever, in this algebraic setting only multiplications by s or 
and differentiation with respect to s are considered 
(with s the Laplace variable), which seems unnecessar- 
ily restrictive. Consider for instance the time-continuous 
version of (11), 



x = u + 9{x + 1) , x(0) — xq , 



(14) 



4 Nonlinearity in the observations is allowed, provided that 
the bias is suitably corrected; for instance the function 
g' i+1 (6) = {l + yi)g i+1 (e) + a 2 (l + Te) with g i+1 (fi) given by 
(12) satisfies TEe{g' i+ i (6)} = for any 6 when the errors Si are 
i.i.d. with zero mean and variance a 2 , and (1/fc) ffiW 
is an unbiased estimating function for 9. 
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where x denotes differentiation with respect to time. Its 
Laplace transform is sX(s) = U(s) + 9X(s) + s~ 1 9 + x , 
which can be first multiplied by s, then differentiated 
two times with respect to s and the result multiplied by 
s~ 2 to avoid derivation with respect to time. This gives a 
estimating function comprising double integrations with 
respect to time. Multiple integrations may be avoided by 
noticing that the multiplication of the initial differential 
equation by any function of time preserves the linearity 
of the estimating function with respect to both 9 and 
the state (provided that the integrals involved are well 
defined). For instance, when -it is a known function of 
time, the multiplication of (14) by the input u followed 
by integration with respect to time gives the estimat- 
ing function [x(t)u(t) — x u ]/t — (1/t) Jq[x(t)u(t) + 

u 2 (t)]cIt — (8/t) J q u(t)[1 + x(r)]dr : which is linear 
in x. Infinitely many unbiased estimating functions can 
thus be easily constructed in this way. (Note that, due 
to linearity, the introduction of process noise in (14) as 
x(t) — u(t)+9[x(t) + l]+dB t (u>), with B t (uj) aBrownian 
motion, leaves the estimating function above unbiased.) 

The analysis of the asymptotic behavior of the estimator 

-k 

6 associated with an estimating function is straightfor- 
ward when the function is unbiased and linear in 9. The 
expression of the asymptotic variance of the estimator 
can be used to select suitable experiments in terms of 
the precision of the estimation, as it is the case for LS 
or ML estimation. However, in general the asymptotic 
variance of the estimator takes a more complicated form 
than M" 1 ^) or M^fofl), see (7, 10), so that DOE 
for such estimators does not seem to have been consid- 
ered so far. The recent revival of interest for this method 
might provide some motivation for such developments 
(see also Section 9). 



$[Mf(^, 9)], for some criterion function <&(•). For mod- 
els nonlinear in 9, this raises two difficulties: (i) the cri- 
terion function, and thus £*, depends on a guessed value 
9 for 9. This is called local DOE (the design £* is opti- 
mal locally, when 9 is close to 9), some alternatives to 
local optimal design will be presented in Section 5.3.5; 
(ii) the method relies on the asymptotic properties of the 
estimator. More accurate approximations of the preci- 
sion of the estimation exist, see e.g. [126], but are com- 
plicated and seldom used for DOE, see [128], [138] (see 
also the recent work [25] concerning the finite sample 
size properties of estimators, which raises challenging 
DOE issues). They will not be considered here. For dy- 
namical systems with correlated observations or contain- 
ing an autoregressive part, classical DOE also relies on 
the information matrix, which has then a more compli- 
cated expression, see Section 6. Also, the calculation of 
the asymptotic covariance of some estimators requires 
specific developments that are not presented here, see 
e.g. [58], [104], [24] for recursive estimation methods. 
For Bayesian estimation, a standard approach for DOE 
consists in replacing M F (£,0) by M F (£, 9) + fl^ 1 /N, 
with the prior covariance matrix for 9, see e.g. [132], 
[27]. Note finally the central role of the design concern- 
ing the asymptotic properties of estimators. In partic- 
ular, the conditions (i) and (ii) of Section 4.1 on the 
design imply some stationarity of the "inputs" and 
guarantee the persistence of excitation, which can be ex- 
pressed as a condition on the minimum eigenvalue of the 
information matrix: limmfjv— ¥00 AminfMir^jV) 9)] > 0, 
with £/v the empirical measure of Ui, . . . , ujv (that is, 
liminfAr^oo A m i n (Mjv)/./V > for the linear regression 
model of Section 2.1, see (2)). 



5 DOE for parameter estimation 



.3 DOE 



5.1 Design criteria 



To obtain a precise estimation of 9 one should first use 
a good estimator (WLS with weights proportional to 
er~ 2 , or ML) and second select a good desigrFI £*. In 
the next section we shall consider classical DOE for pa- 
rameter estimation, which is based on the information 
matrix (10j3- Hence, we shall choose £* that optimizes 



5 We shall thus follow the standard approach, in which the 
estimator is chosen first, and an optimal design is then con- 
structed for that given estimator (even though it may be 
optimal for different estimators); this can be justified under 
rather general conditions, see [119]. 

6 Note that defining rj(0, u) = a^ 1 (u)w(d, u) and fj(d, u) = 
■\/I(u)t](0, u) one can respectively write the matrices (7) 
and (9) in the same form as (10). Also notice that classi- 
cal DOE uses the covariance matrix with the simplest ex- 
pression: DOE for WLS estimation is more complicated for 
non-optimal weights than for the optimal ones, compare 
C(w, £, 9) to M _1 (£, 6) in Section 4.1. Similarly, the asymp- 
totic covariance matrix for a general M-estimator (see, e.g., 



We consider criteria for designing optimal experiments 
(for parameter estimation) that are scalar functions of 
the (Fisher) information matrix (average, per sample) 
(10j_]- For N observations at the design points Uj e U, 
i = 1, . . . , N, we shall denote = (ui, . . . , ujy), which 
is called a finite (or discrete) design of size N , or N -point 
design. The associated information matrix is then 



M F (U? 



N 



' N 



di](9,Ui) dr)(8,Ui) 



08 



89 



(15) 



[72]) is more complicated than for ML. 

7 Notice that the analytic form of the sensitivities 
dn{6 ,u) / d6 of the model response is not required: for a 
model given by differential equations, like in Section 2.2, or 
by difference equations, the sensitivities can be obtained by 
simulation, together with the model response itself; see, e.g., 
Chapter 4 of [188]. 
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The admissible design set U is sometimes a finite set, 
U = {u 1 , . . . , u }, K < oo. We shall more generally 
assume that U is a compact subset of R d . For a linear 
regression model with i.i.d. errors 7V(0, cr 2 ), the ellip- 
soid^^,") = {0/ (0-^ s ) T M F (C7f )(0-0f s ) < 
X^(p) j 'N} , where X^(p) has the probability a to be ex- 
ceeded by a random variable chi-square distributed with 

p degrees of freedom, satisfies Pr{0 £ 1Z(9 LS ,a)} = a, 
and this is asymptotically true in nonlinear situation^" 5 ! 



Most of classical design criteria are related to character- 
istics of (asymptotic) confidence ellipsoids. Minimizing 
$(M) = trace[M ] corresponds to minimizing the sum 
of the squared lengthes of the axes of (asymptotic) con- 
fidence ellipsoids for 9 and is called A-optimal design 
(minimizing <1>(M) = trace[Q T QM~ 1 ] with Q some 
weighting matrix is called L-optimal design, see [31] 
for an early reference). Minimizing the longest axis of 
(asymptotic) confidence ellipsoids for 9 is equivalent to 
maximizing the minimum eigenvalue of M and is called 
-E-optimal design. D-optimal design maximizes det(M), 
or equivalently minimizes the volume of (asymptotic) 
confidence ellipsoids for 9 (their volume being pro- 
portional to 1/vdctM). This approach is very much 
used, in particular due to the invariance of a D-optimal 
experiment by re-parametrization of the model (since 
detM(£,0') = detM(£,0)[det(<907<90 T )]- 2 ). Most of- 
ten Z)-optimal experiments consist of replications of a 
small number of different experimental conditions. This 
has been illustrated by the example of Section 2.2 for 
which p = 4 and four sampling times were duplicated in 
the D-optimal design t*. 



5.2 Algorithms for discrete design 



Consider the regression model (5) with i.i.d. errors and 
N observations at = (ui, . . . , uy) where the sup- 
port points Ui belong to U C R d . The Fisher informa- 
tion matrix Mj?([/f , 9) is then given by (15). The (lo- 
cal) design problem consists in optimizing ^>g(Ui) — 
^\M F (Uf,e)] for a given 9, with respect to C/f <E 
R . If the problem dimension N x d is not too large, 
standard optimization algorithms can be used (note, 
however, that constraints may exist in the definition of 
the admissible set U and that local optima exist is gen- 
eral). When N x d is large, specific algorithms are rec- 
ommended. They are usually of the exchange type, see 
[42], [108]. Since several local optima exist in general, 
these methods provide locally optimal solutions only. 



Such confidence regions for can be transformed into 
simultaneous confidence regions for functions of 6, see in 
particular [160], [14]. 



5.3 Approximate design theory 
5.3.1 Design measures 

Suppose that replications of observations exist, so that 
several Uj's coincide in (15). Let m < N denote the 
number of different Uj's, so that 



M F (y?,9 



dr)(d,Ui) drj(9,u t ) 



i=i 



N 89 



d9 T 



with ri/N the proportion of observations collected at u^, 
which can be considered as the percentage of experimen- 
tal effort at u^, or the weight of the support point u^. De- 
note A(uj) this weight. The design is then character- 
ized by the support points Ui , . . . , u m and their associ- 
ated weights A(ui), . . . , A(u m ) satisfying Y^T=i K u i) — 
1, that is, a normalized discrete distribution on the Uj's, 
with the constraints A(iij) = ri/N, i = 1, . . . , m. Releas- 
ing these constraints, one defines an approximate design 
as a discrete probability measure with support points 
and weights Aj (with YliLi Aj = 1)- Releasing now the 
discreteness constraint, a design measure is simply de- 
fined as any probability measure ^ onW, see [84], and 
M F (£, 9) takes the form (10). Now, M F (£, 9) belongs to 
the convex hull of the set A4 1 of rank-one matrices of the 
form M(<5 U , 9) = I [dr](9, u)/89] [dr](9, u)/89 T }. It is a 
pxp symmetric matrix, and thus belongs to a p(p+ 1) /2- 
dimensional space. Therefore, from Caratheodory's The- 
orem, it can be written as the linear combination of 
p(p + l)/2 + 1 elements of Aii at most; that is 



M 



dr)(6,Ui) dn(9,Ui) 



80 



89 1 



(16) 



with m < p(p + l)/2 + 1. The information matrix as- 
sociated with any design measure £ can thus always be 
considered as obtained from a discrete probability mea- 
sure with p(p + l)/2 + 1 support points at most. This is 
true in particular for the optimal desigrPI. Given such 
a discrete design measure £ with m support points, a 
discrete design U f v with repetitions can be obtained by 
choosing the numb ers of repetitions rj such that ri/N 
is an approximation^^] of Ai, the weight of u 4 for £, see, 
e.g., [150]. 

The property that the matrices in the sum (16) have 
rank one is not fundamental here and is only due 



9 In general the situation is even more favorable. For in- 
stance, if £d is D-optimal (it maximizes det Mf(£, 0)), then 
Mf(£d, 0) is on the boundary of the convex closure of Aii 
and p(p + 1) /2 support points are enough. 

10 This is at the origin of the name approximate design theory. 
However, a design £ (even with a density) can sometimes 
be implemented without any approximation: this is the case 
in Section 6.2 where £ corresponds to the power spectral 
density of the input signal. 
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to the fact that we considered single-output mod- 
els (i.e., scalar observations). In the multiple-output 
case with independent errors, say with y(u) of di- 
mension q corrupted by errors having the q x q co- 
variance matrix E(u), the model response is a q- 
dimcnsional vector r)(0, u) and the information matrix 
for WLS estimation with weights S _1 (u) is M(£, 0) = 
f u [dr 1 T (O,u)/d0}X- 1 (u)[dr 1 (0,u)/dO T }Z(du), to be 
compared with (7) obtained in the single-output case, 
see, e.g., [42], Section 1.7 and Chapter 5. Caratheodory's 
Theorem still applies and, with the same notations as 
above, we can write 



dr)(0,Uj) 
d0 T 



with again m < p(p+ 1)/2 + 1. All the results concerning 
DOE for scalar observations thus easily generalize to the 
multiple-output situation. 

5.3.2 Properties 

Only the main properties are indicated, one may refer 
to the books [42], [167], [125], [4], [149], [44] for more 
detailed developments. Suppose that the design crite- 
rion $[M] to be minimized (respectively maximized) is 
strictly convex (respectively concave). For instance for 
D-optimality, maximizing det[M] is equivalent to maxi- 
mizing logdet[M] and, for any positive-definite matrices 
Mi,M 2 such that Mi ^ M 2 ,Va, < a < l,logdet[(l- 
a)Mi + aM 2 ] > (1 - a) log det[Mi] + a logdet[M 2 ], so 
that <&[•] = logdet[-] is a strictly concave function. Since 
M.p(£,0) belongs to a convex set, the optimal matrix 
= Mi? (£* , 0) for $ is unique (which usually does not 
imply that the optimal design £* is unique; however, the 
set of optimal design measures is convex). The unique- 
ness of the optimum and differentiability of the criterion 
directly yield a necessary and sufficient condition for op- 
timality, and in the case of D-optimality we obtain the 
following, known as Kiefer- Wolfowitz Equivalence Theo- 
rem [85] (other equivalence theorems are easily obtained 
for other design criteria having suitable regularity and 
the appropriate convexity or concavity property). 

Theorem 1 The following statements are equivalent: 

(1) £d is D-optimal for 0, 

(2) max ueW d (u,^£>) = p, 

(3) £,d minimizes max ue ^ dg(u, £d), 
where dg (u, £) is defined by 



(17) 



Theorem 1 relates optimality in the parameter space to 
optimality in the space of observations, in the following 



sense. Let ML be obtained for a design £, the variance 



of the prediction i](0 ML ,u) of the response at u is then 



such that Nva,r[n(0 ML , u)] tends to 



dy(0,u) 
d0 T \0 



M^(£,0) 



dy(0,u) _ d s (u,Q 
80 \0 1 



(18) 



Moreover, for any support point Uj of^o, dg(ui, £,d) = P- 



when N — > oo, see (8). Therefore, a D-optimal experi- 
ment also minimizes the maximum of the (asymptotic) 
variance of the prediction over the experimental domain 
U. This is called G-optimality, and Theorem 1 thus ex- 
presses the equivalence between D and G-optimality. (It 
is also related to maximum entropy sampling considered 
in Section 3.2.2, see [193].) 

Suppose that the observations are collected sequentially 
and that the choice of the design points can be made 
accordingly {sequential design). After the collection of 
y(ui), . . . , y(uAr), which gives the parameter estimates 

-AT -N 

and the prediction n (0 , u), in order to improve the 
precision of the prediction the next observation should 

-N 

intuitively be placed where var[?7(0 , u)] is large, that is, 
where dg N (u, £jv) is large, with £jv the empirical measure 
for the first N design points. This receives a theoretical 
justification in the algorithms presented below. 

5.3.3 Algorithms 

The presentation is for D-optimality, but most al- 
gorithms easily generalize to other criteria. Let £ fc 
denote the design measure at iteration k of the al- 
gorithm. The steepest-ascent direction at £ fc corre- 
sponds to the delta measure that puts mass 1 at 
u k+1 = argmax ug ^ dg(u, £ fc ). Hence, at iteration k, 
algorithms of the steepest-ascent type add the support 
point u£ to £ fe as follows: 

Fedorov Wynn Algorithm: 

• Step 1 : Choose not degenerate (det Mp^ 1 , 0) ^ 0), 
and e such that < e << 1, set k = 1. 

• Step 2 : Compute u£ +1 = argmax ue ^ dg(u, £ fe ). If 
dg(ul +1 , £ fe ) < p + e, stop: £ fe is almost D-optimal. 

• Step 3 : Set £ fe+1 = (1 - a k )£ k + a fc <f u ; +1 , k -» k + 1, 
return to Step 2. 

Fcdorov's algorithm corresponds to choosing the step- 
length a k that maximizes log det Mf(^ +1 , 0), which 
gives a* k = [dg{u* k+1 ^ k )-p]/{p[d e {u* k+1 ^ k )-l]} (note 
that < a* k < l/p) and ensures monotonic convergence 
towards a D-optimal measure £d , sec [42] . 



Note that condition (2) is easily checked when u is scalar 
by plotting dg(u, £) as a function of u. 



Wynn's algorithm corresponds to a sequence satisfying 
< a k < 1, linife^oo a k = and J^iLi a k = oo, see 
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[192] (the convergence is then not monotonic). One may 
notice that in sequential design where the design points 
enter Mjr([/ 1 JV , 8) given by (15) one at a time, one has 



M F (U* +1 ,0) = 



k 



M F (U$,6) 

dr)(0,u k+ x) drj(8,u k+1 ) 



08 



os T 



and, when u^+i = argmax ue ^ <ig(u,£ fc ), this corre- 
sponds to Wynn's algorithm with a k = l/(k + 1). 

Contrary to the exchange algorithms of Section 5.2, these 
steepest-ascent methods guarantee convergence to the 
optimum. However, in practice they are rather slow (in 
particular due to the fact that a support point present 
at iteration k is never totally removed in subsequent it- 
erations — since a k < 1 for any k) and faster methods, 
still of the steepest-ascent type, have been proposed, see 
e.g. [13], [111], [112] and [44] p. 49. An acceleration of 
the algorithms can also be obtained by using a submod- 
ularity property of the design criterion, see [154], or by 
removing design points that cannot support a D-optimal 
design measure, see [61]. 

When the set IA is finite (which can be obtained by a 
suitable discretization), say with cardinality K, the op- 
timal design problem in the approximate design frame- 
work corresponds to the minimization of a convex func- 
tion of K positive weights A^ with sum equal one, and any 
convex optimization algorithm can be used. The recent 
progress in interior point methods, see for instance the 
survey [48] and the books [120], [40], [190], [194], provide 
alternatives to the usual sequential quadratic program- 
ming algorithm. In control theory these methods have 
lead to the development of tools based on linear matrix 
inequalities, see, e.g., [20], which in turn have been sug- 
gested for D-optimal design, see [182] and Chapter 7 of 
[21] . Alternatively, a simple updating rule can sometimes 
be used for the optimization of a design criterion over a 
finite set IA = {u 1 , . . . , u K }. For instance, convergence 
to a Z)-optimal measure is guaranteed when the weight 
A^ of u l at iteration k is updated as 



ifc+1 



= a: 



,d e (u\e) 



(19) 



where £ k is the measure defined by the support points u l 
and their associated weights Af , and de(u, £) is given by 
(17), see [176], [168], [177] and Chapter 5 of [125]. (Note 
that J2? =1 Af +1 = 1 and that Af +1 > when A* > 0.) 
The extension to the case where information matrices 
associated with single points have ranks larger than one 
(see Section 5.3.1) is considered in [180]. 

Finally, it is worthwhile noticing that D-optimal design 
is connected with a minimum-ellipsoid problem. Indeed, 



using Lagrangian theory one can easily show that the 
construction of £d that maximizes the determinant of 
Mf(^,6) given by (10) with respect to the probability 
measure £ on IA is equivalent to the construction of the 
minimum-volume ellipsoid, centered at the origin, that 
contains the set SS e = {0ij(8,u)/08 : u e U) C W, 
see [165]. The construction of the minimum- volume el- 
lipsoid centered at containing a given set UcP thus 
corresponds to a D-optimal design problem on IA for the 
linear regression model rj(8, u) = u T 9. In the case where 
the center of the ellipsoid is free, one can show equiva- 
lence with a Z3-optimal design in a (p + l)-dimcnsional 
space where the regression model is i](8, u) = (1 u T )8, 
8 G M. p+1 , see [166], [175]. Algorithms with iterations of 
the type (19) are then strongly connected with steepest- 
descent type algorithms when minimizing a quadratic 
function, see [147], [148] and Chapter 7 of [146]. In sys- 
tem identification, minimum-volume ellipsoids find ap- 
plications in parameter bounding (or parameter estima- 
tion with bounded errors), see, e.g., [145] and [153] for 
an application to robust control. 

5.3-4 Active learning with parametric models 

When learning with a parametric model, the predic- 

tion yx>(u) at u is r/(8 , u) with 8 estimated from 
the data V = {[ui, y(ui)], . . . , [ujy, y(ujy)]}. As Theo- 
rem 1 shows, the precision of the prediction is directly 
related to the precision of the estimation of the model 
parameters 6: a D-optimal desi gn minimizes the max- 
imum (asymptotic) varianc 

eE3 of fo>(u) for u G IA. 
Similar properties hold for other measures of the pre- 
cision of the prediction. Consider for instance the inte- 
grated (asymptotic) variance of the prediction with re- 
spect to some given probability measure ir (that may 
express the importance given to different values of u in 
U). It is given by *e, H (0 = tra ce {HM" 1 ^ 8)}, where 
H = H(0) = f u [dr][e,u)/ae] [0r 1 (d,u)/08 T }ir(du), see 
(18), and its minimization corresponds to a L-optimal 
design problem, see Section 5.1. The following paramet- 
ric learning problem is addressed in [81]: the measure 
7r is unknown, n samples from it are used, together 
with the associated observations, to estimate 6 and H, 
respectively by 8 and H n (8 ) , N — n samples are then 
chosen optimally for jj„ (£). It is shown that the op- 
timal balance between the two sample sizes corresponds 
to n being proportional to y/N. When the samples 
are cheap and only the observations y(iij) are expensive, 
one may decide on-line to collect an observation or not 
for updating the estimate 8 and the information ma- 
trix M„. A sequential selection rule is proposed in [136], 



11 We could also speak of MSE since in parametric models 
the estimators are usually unbiased for models linear in 0, 
and for nonlinear models (under the condition of persistence 
of excitation) the squared bias decreases as 1/N 2 whereas 
the variance decreases as 1/N, see [19]. 
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which is asymptotically optimal when a given propor- 
tion n = [aN\ of samples, a £ (0, 1), can be accepted 
in a sequence of length N, N — ► oo. 

There exists a fundamental difference between learning 
with parametric and nonparametric models. For para- 
metric models, the MSE of the prediction globally de- 
creases as 1/N, and precise predictions are obtained for 
optimal designs which, from Caratheodory's Theorem 
(see Section 5.3.1) are concentrated on a finite number 
of sites. These are the points that carry the maxi- 
mum information about 6 useful for prediction, in terms 
of the selected design criterion. On the opposite, precise 
predictions for nonparametric models are obtained when 
the observation sites are spread over U, see Section 3.2.2. 
Note, however, that parametric methods rely on the ex- 
tremely strong assumption that the data are generated by 
a model with known structure. Since optimal designs will 
tend to repeat observations at the same sites (whatever 
the method used for their construction) , modelling errors 
will not be detected. This makes optimal design theory 
of very delicate use when the model is of the behavioral 
type, e.g. a neural network as in [33], [34]. A recent ap- 
proach [52] based on bagging (Bootstrap Aggregating, 
see [23]) seems to open promising perspectives. 

5.3.5 Dependence in in nonlinear situations 

We already stressed the point that in nonlinear situa- 
tions the Fisher information matrix depends on 0, so 
that an optimal design for estimation depends on the 
unknown value of the parameters to be estimated. So 
far, only local optimal design has been considered, where 
the experiment is designed for a nominal value 6. Sev- 
eral methods can be used to reduce the effect of the de- 
pendence in the assumed 9. A first simple approach is 
to use a finite set O = {0^\ . . . , O^} of nominal val- 
ues and to design m locally optimal experiments for 

the in O. This permits to appreciate the strength 

of the dependence of the optimal experiment in 0, and 
several 's can eventually be combined to form a sin- 
gle experiment. More sophisticated approaches rely on 
average or minimax optimality. 

In average- optimal design, the criterion *f?e(0 = 
$[Mf (£,#)] is replaced by its expectation JE^^siO} = 
J* $[Mf(£, 6)} ir(dd) for some suitably chosen prior n, 
see, e.g., [43], [26], [27]. (Note that when the Fisher 
information matrix Mp((, 6) is used, it means that the 
prior is not used for estimation and the method is not 
really Bayesian.) In minimax- optimal design, ^g{^) (to 
be minimized) is replaced by its worst possible value 
max0 e0 $[Mjr(5, 9)} when 6 belongs to a given feasible 
set 0, see, e.g., [43]. Compared to local design, these ap- 
proaches do not create any special difficulty (other than 
heavier computations) for discrete design, see Section 
5.2: no special property of the design criterion is used, 
but the algorithms only yield local optima. Of course, 



for computational reasons the situation is simpler when 
7r is a discrete measure and is a finite setR Con- 
cerning approximate design theory (Section 5.3), the 
convexity (or concavity) of <f> is preserved, Equivalence 
Theorems can still be obtained (Section 5.3.2) and glob- 
ally convergent algorithms can be constructed (Section 
5.3.3), see, e.g., [44]. A noticeable difference with local 
design, however, concerns the number of support points 
of the optimum design which is no longer bounded by 
p(p + l)/2 + 1 (see, e.g., Appendix A in [155]). Also, 
algorithms for minimax-optimal design are more com- 
plicated than for local optimal design, in particular 
since the steepest-ascent direction does not necessarily 
correspond to a one-point delta measure. 

A third possible approach to circumvent the dependence 
in 9 consists in designing the experiment sequentially 
(see the examples in Sections 2.3 and 2.4), which is par- 
ticularly well suited for nonparametric models, both in 
terms of prediction and estimation of the model, see Sec- 
tion 3.2.2. Sequential DOE for regression models is con- 
sidered into more details in Section 8. 

6 Control in DOE: optimal inputs for parameter 
estimation in dynamical models 

In this section, the choice of the input is (part of) the 
design, or £ depending whether discrete or approx- 
imate design is used. One can refer in particular to the 
book [196] and Chapter 6 of [58] for detailed develop- 
ments. The presentation is for single- input single-output 
systems, but the results can be extended to multi-input 
multi-output systems. The attention is on the construc- 
tion of the Fisher information matrix, the inverse of 
which corresponds to the asymptotic covariance of the 
ML estimator, see Section 4. For control-oriented appli- 
cations it is important to relate the experimental design 
criterion to the ultimate control objective, see, e.g., [50], 
[53]. This is considered in Section 6.2. 

6.1 Input design in the time domain 

Consider a Box and Jenkins model, with observations 

y k = F(0, z)u k + G{8, z)e k 

where the errors e k are i.i.d. 7V"(0, a 2 ), and F{9, z) and 
G{B, z) are rational fractions in with G stable with a 
stable inverse. Suppose that tr 2 is unknown. An extended 
vector of parameters (3 = (6 T <r 2 ) T must then be esti- 
mated, and one can assume that G(6, oo) = 1 without 
any loss of generality. For suitable input sequences (such 



When O is a compact set of R p , a relaxation algorithm 
is suggested in [143] for minimax-optimal design; stochastic 
approximation can be used for average-optimal design, see 
[142]. 
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that the experiment is informative enough, see [104], p. 

361), NVax($ ML ) -> (f, /3), AT -> oo, with /3 the 
unknown true value of f3 and 



1 51o g7 r(y|/3)91og^(y|/3) 



iV 



9/3 



d(3 [ 



Using the independence and normality of the errors and 
the fact that cr 2 does not depend on 9, we obtain 



M F (£,/3) 



with M F (£,0) = TE 



M F (£,0) 

iT 1 







Na 2 



N 

£ 



de k (9) de k {9 
89 




and efc(0) the prediction error e^ifi) — G~ 1 (9, z)[y k — 
F(9, z)uk] - The fact that a 2 is unknown has therefore no 
influence on the (asymptotic) precision of the estimation 
of 9. Assuming that the identification is p erformed in 
open loop (that is, there is no feedback l 13 1 and that F 
and G have no common parameters (that is, 9 can be 
partitioned into 9 = (9 F 0g) T , withp F components in 
F and pa in 9q), we then obtain 



M F (£,0) 




with 



1 N 

k=l 



o 



x , dF(9,z) 

OU p 



■ x , dF(9,z) 



(20) 



and M F (£, 9) not depending on see, e.g., [58], p. 

131. The asymptotic covariance matrix M i ^ 1 (^, 9) is thus 
partitioned into two blocks, and the input sequence (u k ) 
has no effect on the precision of the estimation of the 
parameters 9q in G. A D-optimal input sequence maxi- 
mizes det M£(£, 9) = det \l/{Na 2 ) £)f =1 v^v^j where 
Vfc is a vector of (linearly) filtered inputs, 



v fc = G- 1 (9,z) 



0F(6, z) 
09 1 



■ Uk , 



(21) 



usually with power or amplitude constraints on u k ■ This 
corresponds to an optimal control problem in the time 
domain and standard techniques from control theory can 
be used for its solution. 



13 One may refer, e.g., to Chapter 6 of [58], [56], [57], [67], 
[49], [50] [76] for results concerning closed-loop experiments. 



6.2 Input design in the frequency domain 

We consider the same framework as in previous 
section, with the information matrix of interest 
M.p(£,9) given by (20). Suppose that the system out- 
put is uniformly sampled at period T and denote 
Mp(£,9) = lira N ^ 00 M 1 p(^9)/T the average Fisher 
information matrix per time unit. It can be written 
as Mp(£,9) = l/(2na 2 ) jZ n V v (uj)dw with V v {oj) 
the power spectral density of given by (21), or 

Mp(£,9) = 1/tt Mp(uj,9)V u (uj)duj with V u {lo) the 
power spectral density of u and 



x G- 1 {9,e~ 3u> ) 



dF{9,e 
39l 



3U\ 



The framework is thus the the same as for approximate 
design theory of Section 5.3: the experimental domain 
IA becomes the frequency domain R + and to the design 
measure £ corresponds the power spectral density V u . 
An optimal input with discrete spectr um a lways exists; 
it has a finite number of support pointj^l (frequencies) 
and associated weights (input power). The optimal in- 
put can thus be searched in the class of signals consist- 
ing of finite combinations of sinusoidal components, and 
the algorithms for its construction are identical to those 
of Section 5.3.3. Notice, however, that no approximation 
is now involved in the implementation of the "approxi- 
mate" design. Once an optimal spectrum has been spec- 
ified, the construction of signal with this spectrum can 
obey practical considerations, for instance on the ampli- 
tude of the signal, see [9] . Alternatively, the input spec- 
trum can be decomposed on a suitable basis of rational 
transfer functions and the optimization of V u performed 
with respect to the linear coefficients of the decomposi- 
tion, see [74] , [75] . Notice that the problem can also be 
taken the other way round: one may wish to minimize 
the input power subject to a constraint on the precision 
of the estimation, expressed through M i ^ 1 (^, 9), see e.g., 
[15], [16]. 

The design criteria presented in Section 5.1 are related 
to the definition of confidence regions, or uncertainty 
sets, for the model parameters. When the intended 
application of the identification is the control of a dy- 
namical system, it seems advisable to relate the DOE to 
control- oriented uncertainty sets, see in particular [53] 
for an inspired exposition. First note that according to 
the expression (18) the variance of the transfer function 
F(9, e 3 ") at the frequency u is approximately Vf{u) = 
{\/N)[dF{9,e^)/d9 T F ] [M.^,9)}- 1 [dF(9,e^)/d9 F ]. 



One can show that the upper bound on their number can 
be reduced from pf{pf + l)/2 to pf, the number of param- 
eters in F, see [58], p. 138. 
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Several iJoo-rclatcd design criteria can then be de- 
rived. For instance, a robust-control constraint of 
the form \\W{e ju )AF(0, e ju )/F(e, e^)^ < 1, with 
AF(0, e i<JJ )/F{6, e jw ) the relative error on F(0,e jw ) 
due to the estimation of and W(e : ' LU ) a weighting func- 
tion, leads to z T (6>,e^)[M|;(C,6>)]- 1 z(6>,e^) < 1 Vw, 

with z(0,e*") = (l/VN)\W(e> u )\[dF(0,e> u )/dO F ]. 
This type of constraint can be expressed as a linear ma- 
trix inequality in Mj£(£, 0), and, using the KYP lemma, 
the problem can be reformulated as having a finite 
number of constraints, see [75]. Notice that minimizing 
max a ,z T (0,e' a ')[M£ (£, 0)1-^(0, e*") can be compared 
to _E-optimum design, see Section 5.1, which minimizes 
max {z:z T z=1} z T [Mf(^,0)]- 1 z. When \W(e? u )\ = 1 
(uniform weighting) and G(6, e 3 ") = 1 (white noise), it 
corresponds to G-optimal design, and thus to D-optimal 
design, see Section 5.3.2. It is also strongly related to 
the minimax-optimal design of Section 5.3.5, (where the 
worst-case is now considered with respect to w), see [44] 
and [143] for algorithms. Alternatively, the asymptotic 
confidence regions for can be transformed into uncer- 
tainty sets SSf(9, £) for the transfer function F(9, e 7 "). 
The worst-case J^-gap over this set can then be com- 
puted, with the property that the smaller this number, 
the larger the set of controllers that stabilize all transfer 
functions in SSf(0,£) [54], [55] (see also [153] for re- 
lated results). Designing experiments that minimize the 
worst-case f-gap is considered in [64] where the problem 
is shown to be amenable to convex optimization. 

The dependence of the design criteria in the unknown pa- 
rameters of the model is a major issue for optimal input 
design, as it is more generally the case for models with a 
nonlinear parametrization (it explains why input spec- 
tra with few sinusoidal components are often considered 
as unpleasant). The methods suggested in Section 5.3.5 
to face this difficulty can be applied here too. In particu- 
lar, input spectra having a small number of components 
can be avoided by designing optimal inputs for different 
nominal values for and combining the optimal spec- 
tra that are obtained, or by using average or minimax- 
optimal design [155]. One can also design the experiment 
sequentially (see Section 8); in general, each design step 
involves many observations and a few steps only are re- 
quired to achieve suitable performance, see, e.g. [8]. 

When on-line adaptation is possible, adjusting the con- 
troller while data are collected and the uncertainty on 
the model decreases can be expected to achieve better 
performance than non-adaptive robust control. Ideally, 
one would wish to have uncertainty sets shrinking to- 
wards a single point representing the true model (or the 
model closest to the true system for the model class con- 
sidered), so that a robust controller adapted to smaller 
and smaller uncertainty sets becomes less and less con- 
servative. While the determination of such robust-and- 
adaptive controllers is still an open issue, a first step in 
the construction is to investigate the properties of the 



parameter estimates in adaptive procedures. 
7 DOE in adaptive control 

The results of Sections 5 and 6 rely on the asymptotic 
properties of the estimator: the asymptotic variance of 

9 ML was supposed to be given by M^/N, which is true 
when the design (input) sequence satisfies some "sta- 
tionarity" condition (the assumption of random design 
was used in Section 5 and a condition of persistence of 
excitation in Section 6). However, this condition may fail 
to hold: a typical example is adaptive control, where the 
input has another objective than estimation. The issues 
that it raises are investigated hereafter. We first present 
a series of simple examples that illustrate the variety of 
the difficulties. 

7. 1 Examples of difficulties 

It is rather well-known that the usual asymptotic nor- 
mality of the LS estimator may fail to hold for de- 
signs such that NIf{U\) is nonsingular for any N 
but converges to a singular matrix, that is, such that 
A min [M F (C/ 1 Ar )] -> as N -> oo, see [131]. We shall not 
develop this point but rather focuss on the difficulties 
raised by the sequential construction of the design. 

Consider the following well-known example (see, e.g., 
[96] , [99] ) of a linear regression model with observations 
Vk = #i+02Ufc+£fc where the errors £fc arei.i.d. with zero 
mean and variance 1. The input (design points uu) sat- 
isfies m = and u n+1 = (1/n) Yn=i u i + ( c / n ) Ya=i £ i- 
Then, one can prove that {0 LS }i — > Oi e «A an d 

~N a s - »N 

{6 LS } 2 ^ 9 2 - 1/c, N -» oo. That is, {0 iS }! con- 

verges to a random variable and {0 LS } 2 to a non-random 
constant different from 6 2 . The non-consistency of the 
LS estimator is due to the dependence of u n+ \ on pre- 
vious £j's, that is, to the presence of feedback control 
(in terms of DOE, the design is sequential). Although 
Mjv = X)»=i(1 Ui) T (l issuchthat A min (MAr) — > 00, 
it does not grows fast enough (in particular, the infor- 
mation matrix M.(U^) = Mjv/A^ tends to become sin- 
gular). Although this example might seem quite artifi- 
cial, one must notice that adaptive control as used e.g. 
in self-tuning strategies, may raise similar difficulties. 

7.1.1 ARX model and self-tuning regulator 

Consider a model with observations satisfying y k = 

aiyk-i H 1- a na Vk-n a + hiik-i + h b nb Uk- nb + £fc, 

which we can write yt = rJO + eu, with = 
(bi,b 2 ,...,ai,a 2 ,...) T and r fc = (u k -i, ■ ■ ■ , u k - nb , 
yk-i, ■ ■ ■ , Vk-n a ) T ■ The objective of minimum- variance 
control is to minimize Rn — J2k=i( v k ~ £fe) 2 - The in- 
put sequence (uk) is then said to be globally convergent 
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if R N /N as N -» oo, see [100], [101], [60]. If is 
known (with bi =/= 0) the optimal controller corresponds 
to u* k = -(oij/fc + • ■ • + a na yk+i-n a + b 2 u k -i + ■ • • + 
6nbUfc+i_ ni) )/6i. But then rj0 = for all k, the matrix 

M^v = J2k=i r k r k i s singular (since Mjv# = 0) and 
is not estimable. If certainty equivalence is forced by 

-k 

using at step k the optimal control calculated for LS , 
then additional perturbations must be introduced to 
guarantee that A m i n (Mjv) tends to infinity fast enough, 
see, e.g., [1]. Using a persistently exciting input u k , pos- 
sibly with optimal features via the approach of Section 
6, permits to avoid this difficulty but is in conflict with 
the global convergence property [100], in particular 
since ||0|| 2 A min (MAr) < R N , see [60]. 

7.1.2 Self-tuning optimizer 

Consider a linear regression model with observations 
yk = r T (u k )0 + The objective is to maximize a func- 
tion /(u, 0) with respect to u. If were known, the 
value u* = u*(0) = argmax u /(u, 0) could be used 
(for instance, u* =_ —9\/(29 2 ) when f(u,0) = 9 + 
9\u + 92U 2 ). Since is unknown, it must be estimated 
from the observations yk, k = 1, 2, . . . Again, the matrix 

Mjy = ^2k=i r ( u fe) rT ( u fe) is singular when the control 
is fixed, that is when Uk = u*(0) (constant) for all k, 
and is then not estimable. Suppose that forced cer- 
tainty equivalence is used with LS estimation, that is 

Ufe + i = u*(0 LS ). Perturbations should then be intro- 
duced to ensure consistency (e.g. randomly, see [22] for 
the quadratic case f(u, 0) = 9 + 9\u + #2W 2 ). The per- 
sistency of excitation is here in conflict with the perfor- 
mance objective (1/n) Y^7=i /( u i> ^) ~* f( u *,6), n — * 
oo. Self-tuning regulation of dynamical systems is con- 
sidered in [89] and [87] for time-continuous systems and 
in [32] for discrete-time systems. With a periodic distur- 
bance of magnitude a playing the role of a persistently 
exciting input signal, the output exponentially converges 
to a neighborhood 0(a 2 ) of the extremum. 

7.2 Nonlinear feedback control is not the answer 

Nonlinear-Feedback Control (NFC) offers a set of tech- 
niques for stabilizing systems with unknown parame- 
ters, see in particular the book [88]. The stability of the 
closed-loop is proved using Lyapunov techniques and, 
although not explicitly expressed in the construction of 
the feedback control, an estimator of the model parame- 
ters is obtained, which differs from standard estimation 
methods. At first sight one might think that NFC brings 
a suitable answer to adaptive control issues. However, 
stability is not consistency and it is the aim of this sec- 
tion to show that a direct application of NFC is bound 
to fail in the presence of random disturbances. Combin- 
ing NFC with more traditional estimation methods and 
suitably exciting perturbations then forms interesting 
perspectives, see Section 9. 



The presentation is made through (a slight modification 
of) one of the simplest examples in [88]. Consider the 
dynamical system (14), with known initial state xo and 
unknown parameter 9 € R. The problem is to construct 
a control u = u{t) that drives xto zero. (Notice that if 9 
were known, u = — (a + 9)x — 9 with a > would solve 
the problem since substitution in (14) gives the stable 
system x = —ax.) The following method is suggested 
in [88]: (i) construct an auxiliary controller that obeys 

9 = x(x + 1), (ii) consider 9 as an estimator of 9 and use 

FCE control with 6, that is, u = -(a+9)x-9,a > 0. The 
stability of this NFC can be checked through the behav- 
ior of the Lyapunov function V(x, 9) = x 2 /2 + (9 — 9) 2 /2. 
It satisfies V(x,9) = —ax 2 , which implies that x tends 
to zero, as required. Then, 9 + u tends to zero (from 
the expression of it) , and 9 + u also tends to zero (from 
Lasalle princ iple ). Therefore, the estimation error 9 — 9 
tends to zer d 15 1 . In the simulations that follow we sim- 
ply use a discretized Euler approximation of the differen- 
tial equation (14) and of the associated continuous-time 
controller, although it should be emphasized that some 
care is needed in general when implementing a digital 
controller on a continuous-time model, see, e.g., [122]. 
The discretization of (14) gives the recurrence equation 
(11), where Xk = x{kT) and Uk — u(kT). We take 9 — 1 
and xq — 1, the sampling period T is taken equal to 
0.01 s. (Notice that the open- loop system is unstable.) 
The NFC is discretized as 

§ k+1 = § k +Tx k (x k + l), 
u k = -{a + 9 k )x k -9 k , 

where 9 k = 9(kT). We take 9 Q = 2 and a = 1 s _1 . (Al- 
though the book [88] only concerns the stabilization of 
continuous-time systems, one can easily check that the 
fixed point Xk = 0, 6 k = 9 of the controlled discretized 
model above is Lyapunov-asymptotically stable.) Simu- 
lation results are presented in Figure 2. The initial de- 
crease of the state variable (solid line) is in agreement 
with the time-constant a -1 = 1 s and, for t > 8 s, the pa- 
rameter estimates (dashed line) and state become very 
close to the targets, respectively 9 = 1 and zero. 

Suppose now that the state is observed through yo = 
xq and yk = Xk + Sk for k > 1, where (e k ) denotes a 
sequence of i.i.d. errors normal J\f(0, a 2 ) (setting a 2 = 
S 2 T one may suppose for instance that e k is S times 
the integral of a realization of the standard Brownian 
motion between and T). We take a = xo/2 = 0.5, a 
rather extreme situation, to emphasize the influence of 
measurement errors. The evolutions of Xk (dash-dotted 



15 In the example on p. 3-4 of [88], x — u + 9x, 6 — x 2 
and u — — (a + 9)x, a > 0, so that x tends to zero but not 
necessarily the estimation error 9 — 9. 
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Fig. 2. Evolution of Xk (solid line) and 8k (dashed line) as 
functions of k for the system (11) with NFC (22) (8 = 1, 
Oo = 2, xo = 1, a = 1, sampling period T = 0.01 s). The 
curves in dash-dotted line and dotted line respectively show 
Xk and 8k when yk is substituted for Xk in (22). 

line) and 0k (dotted line) when yk is substituted for Xk in 
(22) are presented on Figure 2: the sequence of parameter 
estimates does not converge, the state fluctuates and is 
clearly not driven to zero. 



7.3 Some consistency results 

The difficulties encountered in Sections 7.1.1, 7.1.2 and 
7.2 are general in regulation-type problems: in order to 
satisfy the control objective, the input should asymptot- 
ically vanish, which does not bring enough excitation for 
guaranteeing the consistent estimation of the model pa- 
rameters. The control objective is thus in conflict with 
parameter estimation, and perturbations must be intro- 
duced. It is then of importance to know the minimal 
amount of perturbations required to ensure consistency 
of the estimator on which the control is based. Some re- 
sults are presented below for the case of linear regression. 

7.3.1 LS estimation 

Consider a linear regression model with observations 
yk = rjB + £k, and denote by Tk the cr-algebra gen- 
erated by the errors £!,...,£&. They are supposed 
to form a martingale difference sequence (ek is Tk-\ 
measurable and E{efc| Tk-\\ = 0) and to be such 
that sup fc E{e^| Tk-i\ < oo (with i.i.d. errors with 
zero mean and finite variance as a special case). Let 

m n = Ef=i r fe r fe , t hen M w ^ for iV ^ oo is 

-AT as - 

• sufficient for 9 LS —>■' 9 when the regressors are 
non-random constants, see [97], [98]; 

• necessary and sufficient if, moreover, the errors are £& 
i.i.d., 



but Mj, 1 



is not sufficient for 9 



LS 



9 if Tk is 



Tk-\ measurable (see the first example of Section 7.1). 

In the latter situation, a sufficient condition for 
9 LS 9 when N — ► oo is that A m i n (rvljv) oo and 



see [99]. In some sense, this is the best possible condi- 
tion: it is only marginally violated in the first example 
of Section 7.1, where [log A max (M,/v)]/A m in(MAr) tends 
a.s. to a random constant. Note that this condition is 
much weaker than the persistence of excitation which 
requires that Mjy grows at the same speed as N. 

7.3.2 Bayesian imbedding 

An even weaker condition is obtained for Bayesian esti- 
mation. Let 7r be a prior probability measure for 9 and 
denote by P the probability measure induced by the er- 
rors £/., k = 1, . . . , oo. Denote T' k the cr-algebra gener- 
ated by the observations y\, . . . , y% and suppose that 
is ^._ 1 -measurable. Suppose that the parameters are 

estimated by the posterior mean 9 B = lEj^lJ 7 ^} and 
denote by Cat = Var(0|J r ] v ) the posterior covariance 

matrix. Then, from martingale theory, 9 B and Cat both 
converge (tt x P)-a.s. when N — * oo, see [174], and all 
what is required for the (tt x P)-a.s. consistency of the 
estimator is Cat — > (tt x P)-a.s. Now, for a linear re- 
gression model with i.i.d. normal errors £& and a normal 
prior for 9, Bayesian estimation coincides with LS esti- 
mation (when the prior for 9 is suitably chosen), Cat is 

(tt x P)- 

9 (tt x P)-a.s. The re- 
quired condition is thus as weak as when the regressors 
Yk are non-random constants! Note, however, that the 
convergence is almost sure with respect to 9 having the 
prior 7r; that is, singular valu es of 9 may exist for which 
consistency does not holcP^I. 

This very powerful technique which analyses the prop- 
erties of LS estimation via a Bayesian approach is called 
Bayesian imbedding, see [174], [93]. Although in its orig- 
inal formulation it requires the measurement errors to 
be normal, the normality assumption is relaxed in [68] 
to the condition that the density tp of the errors is log- 
concave (d 2 log ip(t)/dt 2 < 0) and strictly positive with 
respect to the Lebesgue measure fi, the prior measure 
tt being absolutely continuous with respect to fi. More 
generally, the consistency of Bayesian estimators can 
be checked through the behavior of posterior covariance 
matrices, see [69]. Bayesian imbedding allows for easier 
proofs of consistency of the estimator, and permits to 
relax the conditions on the perturbations required to ob- 
tain consistency. This is illustrated below by revisiting 
the examples of Sections 7.1.1 and 7.1.2. 



proportional to M^ 1 and therefore, Mj, 1 

~N f>N 

a.s. is sufficient for 9 LS = 9 B 



16 In the first example of Section 7.1 is not T h -\- 
measurable since Uk is not obtained from previous observa- 
tions. Modify the control into u n +i = Qi + (02 /n) YL7=1 Ui + 
(c/n) Ysi=l 2/*> which is jF^-measurable. Then, 6 L g is not 
consistent when takes the particular value 81 = —ai/c, 
62 — (1 — 02) /c, so that the control coincides with previous 
one, u n+ i = (1/n) Ui + ( c / n ) Z)"=i £i - 
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Consider again the self-tuning regulator of Section 7.1.1. 
When LS estimation is used with forced certainty equiv- 
alence control, it is required to perturb the system to 
obtain a globally convergent input. It can be shown [94] 
that the control objective R n grows at least as log n, and 
randomly perturbed input sequences achieving this per- 
formance are proposed in [101]. Using Bayesian imbed- 
ding, global convergence can be obtained without the 
introduction of perturbations, see [93]. 

For the self-tuning optimizer of Section 7.1.2, Astrom 
and Wittenmark [2] have suggested a control of the 

type u k+ i = argmax u /(u,0 ) + a k d(u, where 
d(u,£) = r T (u)M- 1 (^)r(u) is the function (17) used 
in £>-optimal design, is the empirical measure of 
the inputs ui,...,Ufc and (a k ) is a sequence of pos- 
itive scalars. Note that d(u,£k)/fc = r T (u)M^ 1 r(u) 

with Mfe = Yli=i r ( u i) rT ( u i)- This strategy makes 
a compromise between optimization (maximization of 

f(u,0 ), for a k small) and estimation (D-optimal de- 
sign, for a k large). Using the results of Section 7.3.1, 
the following is proved in [135] for LS estimation. When 
the errors Ek form a martingale difference sequence 
with sup fc Wi{e^ k \Tk-\\ < oo, if (ak/k)\ogak is mono- 
tonically decreasing and ak/ (log k) 1+s monotonically 

~k a s — 

increases to infinity for some S > 0, then LS -V 0, 
(lA)Eti/K^) /( u *^) = max u /(u,0) and 
£,k — » <5 U * (weak convergence of probability measures) 
as k — > oo. That is, the LS estimator is strongly consis- 
tent, and at the same time the design points tend to 
concentrate at the optimal location u* . Using Bayesian 
imbedding, the same results are obtained when the con- 
ditions above on ak are relaxed to ak — > oo, ak/k — > 0, 
provided the errors Ek are i.i.d. Af(0, a 2 ), see [141]. 

7.4 Finite horizon: dynamic programming and dual 
control 

The presentation is for self-tuning optimization, but the 
problem is similar for other adaptive control situations. 
Suppose one wishes to maximize y^-i Wjf(ui, 0) for 
some sequence of positive weights u>i, with ^unknown 
and estimated through observations yi = r](6, Uj) + 
Let 7r denote a prior probability measure for and de- 
fine = (ui, . . . , u fe ), Y? = (y u ...,y k ) for all k. The 
problem to be solved can then be written as 

maxJE{wif(ui,0) + m&xTE{w 2 f (u 2 , 0) H 

Ul U 2 

max E{«;jv-i/(ujv-i, 0) 

UJV-1 

+ maxJE{w N f(u N , fl)!^" 1 ,^" 1 } 

" |^- a ,if- a }-|ui,yi}} (23) 

and thus corresponds to a Stochastic Dynamic Pro- 
gramming (SDP) problem. It is, in general, extremely 



difficult to solve due to the presence of imbedded maxi- 
mizations and expectations. The control has a dual 
effect (see e.g. [7]): it affects both the value of f(uk, 0) 
and the future uncertainty on through the posterior 
measures tt(0\UI,Y^), i> k. One of the main obstacles 
being the propagation of these measures, classical ap- 
proaches are based on their approximation. Consider 
stage k, where and Y* are known. Then: 

• Forced Certainty Equivalence control (FCE) replaces 
n(0\U{, YD for i > k (a "future posterior" for i > k), by 

~k 

the delta measure Sg k , where is the current estimated 
value of (see the examples of Sections 7.1.1 and 7.1.2); 

• Open-Loop-Feedback-Optimal control (OLFO) re- 
places -k(Q\U{, 17), i > k, by the current posterior mea- 
sure n(0\Ui, Yi) (moreover, most often this posterior 

is approximated by a normal distribution Af(0 , Cfe)). 

The FCE and OLFO control strategies can be con- 
sidered as passive since they ignore the influence of 
Ufc+i, Uk+2 ••• on the future posteriors n(0\U{,Y^), see, 
e.g., [179]. On the other hand, they yield a drastic sim- 
plification of the problem, since the approximation of 
n(0\U{,Y{) for i > k does not depend on the future 
observations yk+i, Uk+2 ■ ■ ■ This, and the fact that few 
active alternatives exist, explains their frequent usage. 

The active- control strategy suggested in [178] is based on 
a linearization around a nominal trajectory u(i) and ex- 
tended Kalman filtering. It does not seem to have been 
much employed, probably due to its rather high com- 
plexity. A modification of OLFO control is proposed in 
[137]. It takes a very simple form when the model re- 
sponse n(0,u) is linear in 0, that is, r](0, u) = r T (u)#, 
the errors are i.i.d. normal 7V(0, er 2 ) and the prior for is 
also normal. Then, at stage k, the posterior n(0\U{,Y}) 

is the normal Af(0 B , C k ) for i = k and can be approx- 

~k 

imated by Af(0 B , C,) for i > k, where B and Cfc are 
known (computed by classical recursive LS) and d fol- 
lows a recursion similar to that of recursive LS, 



Ci+i — Cj — 



Cir(u i+ i)r T (u i+ i)Ci 
er 2 + r T (ui + i)C 4 r(u J+ i) 



i > k. 



Note that C$ depends of u^, +1 , VLk+2 ■ ■ ■ , u i (which makes 
the strategy active), but not on yk+i, Uk+2 ■ ■ • (which 
makes it implemcntable). This method has been suc- 
cessfully applied to the adaptive control of model with 
a FIR, ARX, or state-space structure, see, e.g., [91], 
[92]. It requires, however, that the objective function 
/(u, 0) in (23) be non linear in to express the depen- 
dence in the covariance matrices Cj. Indeed, suppose 
that in the self-tuning optimization problem the func- 
tion to be maximized is the model response itself, that is, 

f(u,0) = r T (u)0. Then, IE{/(u, 6)\U{, Yf} = r T (u)<T B 

"k 

and using the approximation J\f(0 B ,Ci) for the future 
posteriors n(0\U{, Y±), i > k, one gets classical FCE con- 
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trol based on the Bayesian estimator 9 B . On the other 
hand, it is possible in that case to take benefit of the 
linearity of the function and obtain an approximation of 

E{max u r T (u)0^Vi V ~ 2 J y i JV_2 } for small a 2 , which 
can then be back-propagated; see [141] where a control 
strategy is given that is within 0(a 4 ) of the optimal (un- 
known) strategy u£ for the SDP problem (23). 

8 Sequential DOE 

Consider a nonlinear regression model for which 
the optimal design problem consists in minimizing 
^g(U^) = $[M F ([/f ,0)] for some criterion $, with 
unknown. In order to design an experiment adapted 
to 0, a natural approach consists in working sequen- 
tially. In full- sequential design, one support point 

-fc-i 

is introduced after each observation: is esti- 
mated from the data (Y{ 1 , U^ 1 ) and next Ufe min- 
imizes $[M F ({J7 1 fc - 1 ,u fc },0 )] (for D-optimal de- 
sign, this is equivalent to choosing that maximizes 
d§ h -i(u k ,£k-i) with de(u,£) the function (17) and £ fe _i 
the empirical measure for the design points in U^ 1 ). 
Note that it may be considered as a FCE control strat- 
egy, where the input (design point) at step k is based 

-fc-i 

on the current estimated value . For a finite hori- 
zon N (the number of observations), the problem is 
similar to that of Section 7.4 (self-tuning optimizer), 
with the design objective $[M F (J7 ] ' V , 0)] substituted for 

Si=i w if( u ii9)- Although the objective does not take 
an additive form, the problem is still of the SDP type, 
and active-control strategies can thus be constructed to 
approximate the optimal solution. However, they seem 
to only provide marginal improvements over tr adit ional 
passive strategies like FCE control, see e.g. [51] 17 1 

Although a sequentially designed experiment for the 
minimization of $ [M F (ET^ , 0)] aims at estimating 
with maximum possible precision, it is difficult to assess 

that 6 N ^ as N -> oo (and thus £ N ^ £*(0), with 
£*(0) the optimal design for 0) when full-sequential de- 
sign is used; see [191] for a simple example (with a posi- 
tive answer) for LS estimation. When full-sequential de- 
sign is based on Bayesian estimation (posterior mean), 
strong consistency can be proved if the optimal design 
£*(0) satisfies an identifiability condition for any 0, see 
[70] (this is related to Bayesian imbedding considered in 
Section 7.3.2). The asymptotic analysis of multi-stage 
sequential design is considered in [28] and the construc- 
tion of asymptotically optimal sequential design strate- 
gies in [172], where it is shown that using two stages is 



17 An active strategy aims at taking into account the influ- 
ence of current decisions on the future precision of estimates; 
in that sense DOE is naturally active by definition, even if 
based on FCE control. Trying to make sequential DOE more 
active is thus doomed to small improvements 



enough. Practical experience tends to confirm the good 
performance of two-stage procedures, see, e.g., [8]. 

9 Concluding remarks and perspectives in DOE 

Correlated errors. Few results exist on DOE in the pres- 
ence of correlated observations and one can refer e.g. to 
[127], [117], [45] and the monograph [116] for recent de- 
velopments. See also Section 3.2.2. The situation is dif- 
ferent in the adaptive control community where corre- 
lated errors are classical, see Section 7.3.1 (for instance, 
the paper [123] gives results on strong laws of large num- 
bers for correlated sequences of random variables under 
rather common assumptions in signal or control applica- 
tions) , which calls for appropriate developments in DOE. 
Notice that when the correlation of the error process 
decays at hyperbolic rate (long-range dependence), the 
asymptotic theory of parameter estimation in regression 
models (Section 4) must itself be revisited, see, e.g., [73]. 

Nonlinear models. The presentation in Sections 6 and 7 
has concerned models with a linear dynamic (e.g. with a 
Box and Jenkins structure), but models that corresponds 
to nonlinear differential or recurrence equations raise no 
special difficulty for the construction of the Fisher infor- 
mation matrix (Section 6), which can always be obtained 
through simulations. The main issue concerns linearity 
with respect to the model parameters 0. In particular, 
few results exist concerning the extension of the results 
of Section 7.3 to models with a nonlinear parametriza- 
tion (see [95] for LS estimation and [71] for results on 
Bayesian imbedding when has a discrete prior). 

DOE without persistence of excitation. In the context 
of self-tuning regulation, we mentioned in Section 7.3.2 
that random perturbations may be added to certainty 
equivalence control based on LS estimation to guarantee 
the strong consistency of parameter estimates and the 
asymptotically optimal growth of the control objective, 
see [101]. This is an example of a situation where "non- 
stationary experiments" could be designed in order to 
replace random perturbations by inputs with suitable 
spectrum and asymptotically vanishing amplitude. In 
the same vein, the modified OLFO control proposed in 
[137] and the small-noise approximation of [141] (de- 
signed for self-tuning optimization, but extendable to 
self-tuning regulation) make a good compromise be- 
tween exploration and exploitation when the horizon 
is finite (see Section 7.4). An asymptotic analysis for 
the horizon tending to infinity could permit to design 
simpler non-stationary strategies. 

Nonparametric models, active learning and control. 
Strategies are called active in opposition to passive ones 
that collect data "as they come". DOE is thus intrin- 
sically active, and its use in learning leads to methods 
that try to select training samples instead of taking 
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them randomly. Although its usefulness is now well per- 
ceived in the statistical learning community, it is still at 
an early stage of development due to the complexity of 
DOE for nonparametric models. More generally, active 
strategies are valuable each time actions or decisions 
have a dual effect and a compromise should be made 
between exploration and exploitation: exploration may 
be done at random, but better performance is achieved 
when it is carefully planned. For instance, active strate- 
gies connected with Markov decision theory could yield 
improvements in reinforcement learning, see e.g. the 
survey [80]. 

Linking nonparametric estimation with control forms a 
quite challenging area, where the issues raised by the 
estimation of the function that defines the dynamics of 
the system come in addition to those, more classical, of 
adaptive control with parametric models, see, e.g., [134], 
[133] for emerging developments. 

Algorithms for optimal DOE. The importance of con- 
structing criteria for DOE in relation with the intended 
objective has been underlined in Section 6.2 where cri- 
teria of the minimax type have been introduced from 
robust-control considerations. (Minimax-optimal design 
is also an efficient method to face the dependence of lo- 
cal optimal design in the unknown value of the model 
parameters, see Section 5.3.5.) Although the minimax 
problem can often be formulated as a convex one, some- 
times with a finite number of constraints, the develop- 
ment of specific algorithms would be much profitable to 
the diffusion of the methodology, in the same way as the 
classical design algorithms of Sections 5.2 and 5.3.3 have 
contributed to the diffusion of optimal DOE outside the 
statistical community where it originated. 

Another view on global optimization. Let /(u) be a func- 
tion to be maximized with respect to u in some given 
set U; it is not assumed to be concave, nor is the set 
U assumed to be convex, so that local maxima may ex- 
ist. The function can be evaluated at any given input 
Ui e U, which gives an "observation" y(u,) = /(uj). In 
engineering applications where the evaluation of / cor- 
responds to the execution of a large simulation code, ex- 
pensive in terms of computing time, it is of paramount 
importance to use an optimization method parsimonious 
in terms of number of function evaluations. This en- 
ters into the framework of computer experiments, where 
Kriging is now a rather well-established tool for mod- 
elling, see Section 3.1. Using a Bayesian point of view, 
the value /(u) after the collection of the data T>k = 
{[ui, y(ui)], . . . , [ufe, y(ufe)]} can be considered as dis- 
tributed with the density ip(y\T>k, u) of the normal dis- 
tribution N(yv k (u), pj, k (u)), where yx>(u) and p|>(u) 
are respectively given by (3) and (4). An optimization 
algorithm that uses this information should then make 
a compromise between exploration (trying to reduce the 
MSE p\) (u) by placing observations at values of u where 
p|j (u) is large) and exploitation (trying to maximize the 



expected response yx>(u)). A rather intuitive method is 
to choose Ufc+i that maximizes yx> k (u) + ctpx> k ( u ) for 
some positive constant a, see [35]. In theory, Stochas- 
tic Dynamic Programming could be used to find the op- 
timal strategy (or algorithm) to maximize /(u): when 
the number TV of evaluations is given in advance, the 
problem takes the same form as in (23) with Wi = for 
i = 1,. . . ,N — 1. However, in practise this SDP prob- 
lem is much too difficult to solve, and approximations 
must be used to define suboptimal searching rules. For 
instance, one may use a one-step-ahead approach and 
choose the input Ufc + i that maximizes the expected im- 
provement EI(u) = CSax [y - y™ ax ] ip(y[D k ,u)dy, with 

y k 

y™ ax = max{y(ui), . . . , y(u^)}, the maximum value of 
/ observed so far, see [110], [109], [161]. The function 
/ is then evaluated at Ufc+i, the Kriging model is up- 
dated (although not necessarily at each iteration), and 
similar steps are repeated. Each iteration of the result- 
ing algorithm requires one evaluation of / and the solu- 
tion of another global optimization problem, for which 
any ad-hoc global search algorithm can be used (the op- 
timization concerns the function EI, which is easier to 
evaluate than /). For instance, it is suggested in [11] to 
update a Delaunay triangulation of the set U based on 
the vertices u i7 i = 1, . . . , k, and to perform the global 
search for the maximization of EI(u) by initializing lo- 
cal searches at the centers of the Delaunay cells. Note 
that the algorithm tends asymptotically to observe ev- 
erywhere in U, since the expected improvement EI(u) 
is always strictly positive at any value u where no obser- 
vation has been taken yet. However, a credible stopping 
rule is given by the criterion itself: it is reasonable to stop 
when the expected improvement becomes small enough. 
One can refer to [161], [79] for detailed implementations, 
including problems with constraints also defined by sim- 
ulation codes. Derivative information on / can be in- 
cluded in the Kriging model, as indicated in Section 3.1, 
and thus used by the optimization algorithm, see [102]. 
It seems that suboptimal searching rules looking further 
than one-step-ahead have never been used, which forms a 
rather challenging objective for active control. Also, the 
one-step-ahead method above is completely passive with 
respect to the estimation of the parameters of the Krig- 
ing model, and active strategies (even one-step-ahcad) 
are still to be designed, see Section 3.2.2. Note finally 
that the definition of the expected improvement EI is 
not adapted to the presence of errors in the evaluation 
of /, so that further developments are required for sit- 
uations where one optimizes the observed response of a 
real physical process. 

NFC, FCE, estimating functions and DOE. Consider 
again the example of NFC in Section 7.2, in the case 
where the state Xk is observed though y k = x k + £k 
and yk is substituted for Xk in (22). As shown in Fig- 
ure 2, the NFC estimator 9k is then not consistent. On 
the other hand, in the same situation more classical es- 
timation techniques have satisfactory behaviors (hence, 
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in terms of DOE, NFC brings enough information to 
estimate the unknown 8). Consider for instance the LS 
estimator of 8 in (11), obtained from yi,.. ■ ,Vk- Since 
x k is no nlinear in 8, recursive LS cannot be used di- 
recthvl 18 1, but the estimation becomes almost recursive 
using a stochastic-Newton algorithm, see, e.g., [188], p. 
208. Figure 3 (top) shows that the corresponding esti- 
mator 8 h converges quickly to the true value 9 = 1. The 
evolution of the estimator (13) obtained from an estimat- 
ing function is presented on the same figure (bottom). 

Its convergence is slower than that of 8 k , due in particu- 
lar to the presence of the term y k /(kT) in the numerator 
of (13), but its construction is much simpler. An impor- 
tant consequence is that the analysis of its asymptotic 
behavior in an adaptive-control framework is easier than 
for LS estimation: 8 k is a consistent estimator of 8 when 
(1/k) J2i=i x i IS bounded away from —1 (which is the 
case since the control drives this quantity to zero) . Sim- 
ulations confirm that when applying the FCE controller 
u k = -(a + 8 k )x k (8 k )-8 k to (11), with x k {8 k ) obtained 
by substituting 8 k for 8 in (11), 8 k converges to 8 and the 
state x k is correctly driven to zero. Simulations show, 
however, that the dynamic of the state is slower than for 
NFC; it is thus tempting to combine both strategies. For 
instance, one could use FCE control based on 8 k when 
the standard deviation of 8 k is smaller than some pre- 
scribed value, and use NFC otherwise. The simulation 
results obtained with such a switching strategy are en- 
couraging and indicate that the combination of different 
estimation methods may improve the performance of the 
controller. At the same time, this raises more issues than 
it brings answers. Some are listed below. 









5 200 400 600 BOO 1000 1200 1400 1600 1800 2000 

k 









200 400 600 800 1000 1200 1400 1600 1800 2000 



k 

Fig. 3. Top: evolution of the LS estimator 9 , bottom: evo- 
lution of 6 k given by (13). 



The situation would be much easier for the autoregres- 
sive model y i+1 = + T[m + 9(yi + 1)] + £i+i where 8 
could be estimated by recursive LS. Note that this model 
can be considered as resulting from the discretization of 
y = u + 8(y + 1) + SdBt(uj), with Bt(u>) the standard Brow- 
nian motion (starting at zero with variance 1), which cor- 
responds to the introduction of process noise into (14), and 
gives e»+i = S[B (i+1)T (cu) - B iT (u))]. 



Combining NFC, which relies on Lyapunov stability, 
with simple predictors, e.g. based on FCE, while preserv- 
ing stability, forms an interesting challenge for which re- 
sults on input-to-state stabilizing control could be used 
(see, e.g., Chapters 5 and 6 of [88] for continuous-time 
and [121] for discrete-time control) . In classical FCE con- 
trol the consistency of the estimator is a major issue. 
Using suitable estimating functions could then lead to 
fruitful developments, due the flexibility of the method 
and the simplicity of the associated estimators. Suitably 
designed perturbations could be introduced for helping 
the estimation, possibly following developments simi- 
lar to those that lead to the active-control strategies 
of Sections 7.3.2 and 7.4. At the same time, the per- 
turbed control should not endanger the stability of the 
system. Designing input sequences (possibly vanishing 
with time) that bring maximum information for estima- 
tion subject to a stability constraint forms an unusual 
and challenging DOE problem. Finally, as a first step 
towards the design of robust-and-adaptive controllers 
mentioned at the end of Section 6.2, one may replace 
a traditional FCE controller by one that gives the best 
performance for the worst model in the current uncer- 
tainty set (roughly speaking, for the self-tuning prob- 
lem considered in Section 7.4 this amounts to replacing 
expectations in (23) by minimizations with respect to 
in the current uncertainty set). The determination of 
active-control strategies for such minimax (dynamical 
games) problems seems to be a promising direction for 
developments in adaptive control. 
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